Today we will go through a series of steps that are familiar to many scientist doing empirical work. We will pretend to have collected some data1, clean them, visualize them, and analyze them using common statistical models. We will do all these things with the tidyverse, a collection of R packages for data manipulation and plotting that aims at being easily readable not only for machines, but also for humans2. These operations will be run in an environment that ensures computational reproducibility: if you have the initial data set, you will be able to reproduce the final results with a mouse click.

On your computer, you should already have installed R and RStudio (if not, follow the instructions for your operating system here and here).

Setup

One possible way to work on a project is to open a new .R file and write your code there. However, there are a few disadvantages:

  • your collaborators may not have your same folder structure, i.e., where you store data and code
    • you could specify a working directory with setwd(), e.g., setwd("NAME/OF/YOUR/DIRECTORY")
  • you cannot know the state of your collaborators’ working environment, e.g., what functions or packages they have already loaded
    • you could include, at the beginning of your script, the following lines:
      • cat("\014") to clear the console
      • rm(list = ls()) to clear the environment (but it does not unload active packages!)

You could do all those things, but… should you?

Jenny Bryan explains the reasons (without threats to commit arson) in this blog post. The main idea is to package code in a self-contained environment that everyone (including your future self) can run without changing their own environment.

R Projects

Following Jenny Brian’s suggestion (and to avoid her wrath) we can create a project directly from RStudio:

  • click the “File” menu button, then “New Project”
  • click “New Directory”
  • click “New Project”
  • type in the name of the directory to store your project, e.g., “YOURNAME_MPI2020_intro_tidyverse”
  • click the “Create Project” button

If you check the directory you just created, you will see a file called YOURNAME_MPI2020_intro_tidyverse.Rproj. If your RStudio session is still open, look at the top of the active window: the R Project is already active.

Install Packages

It is time to install the first package of this session: here3. For a brief overview of what the package can do, see this GitHub repository (authored by Jenny Bryan!).

Once the package is installed, we can load it:

This package is very useful to keep raw data, analysis scripts, preprocessed data, and reports in separate subdirectories (which is what you should always aim for). In which directory are we now?

## [1] "/home/aschetti/Documents/Projects/MPI2020_intro_tidyverse"

We are now in the directory of the .Rproj file. However, our data are in a subdirectory called raw-data. here allows you to link to that directory without actually going there.

## [1] "/home/aschetti/Documents/Projects/MPI2020_intro_tidyverse/raw-data"

In other words, we are still inside our R Project directory (/home/aschetti/Documents/Projects/MPI2020_intro_tidyverse) but we can load and save files inside the respective subfolders.

Let’s verify that the data we need are actually in the raw-data subdirectory:

## [1] "MixedAttitude.dat"

Yes, the file MixedAttitude.dat is what we are going to play with today.

Exercise 1

Install and load the tidyverse.

Data manipulation

The data are in .dat format… how can we load them?4 The function read_tsv from the readr package (part of the tidyverse) can load this kind of tab-separated files:

## Parsed with column specification:
## cols(
##   ssj = col_double(),
##   gender = col_double(),
##   beerpos = col_double(),
##   beerneg = col_double(),
##   beerneut = col_double(),
##   winepos = col_double(),
##   wineneg = col_double(),
##   wineneut = col_double(),
##   waterpos = col_double(),
##   waterneg = col_double(),
##   waterneu = col_double()
## )

For additional details on the arguments of this function, type ?read_tsv in the console.

Now the data are in a tibble, the equivalent of a data.frame in the tidyverse. Type att (the name of the data set in our environment) in the console to see the first lines.

To see the full data set, click on att in the Environment window (by default on the upper right corner).

These are the data of 21 participants who saw neutral, positive, or negative advertisement of beer, wine, and water in 3 separate sessions. They were asked to rate the drinks on a scale ranging from – 100 (dislike very much) through 0 (neutral) to 100 (like very much). Researchers wanted to reduce binge drinking in teenagers, so they hoped that pairing negative imagery with alcoholic beverages would lower these ratings5.

When the data set is big, it is useful to have a general idea of what it contains. The function glimpse

Filtering

Looking at the data, you can see that the ratings of participant #99 are a bit strange… perhaps there was a technical problem, i.e., the rating scale was out of the -100/100 range.

Let’s filter this participant out. More specifically, we keep all participants that are not participant #99.

The symbol %>% is the pipe operator , which allows to serially concatenate functions applied to the same data frame. Read it as “and then”. In the example below, we called the data frame att and then applied the filter function to discard participant #99.

How to verify that the filtering worked? You could click on att in the Environment window and look for participant #99, but it can be problematic with big data sets. An easy and generalizable way is to check if the filtered data set still contains participant #99:

The tibble is empty, confirming the deletion.

Variable selection

For the exercises in this course, we will only need a subset of variables in this data set.

Researchers were interested in reducing binge drinking, so we will focus our attention on negative imagery and one alcoholic beverage. We choose beer. We also need control conditions, i.e., neutral imagery and water.

In the next code snippet, we simultaneously select and rename the variables we want to keep.

Recoding

Let’s look at the variable gender:

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

A string of 1s and 2s. It’s easy to get a glimpse of this variable because there are not many observations. When you have more observations, better use unique:

## [1] 1 2

What type of variable is it?

## [1] "numeric"

Class integer has only integer values (i.e., no decimals or complex numbers).

There are many variable types in R. An interesting one is factor, i.e., variables which can have only a limited number of different values… they are basically categorical variables. In our case, it would make sense to consider gender as a factor with two levels, male and female. Let’s transform it:

Now we have two variables with identical information… let’s get rid of gender:

Converting

Data frames can be in wide or long format:

  • wide format: participants as rows, conditions as columns (e.g., SPSS)
  • long format: every row represents an observation belonging to a particular condition

Our data are now in wide format. However, the packages we are going to use in this course need data in long format. Let’s convert from wide to long:

The column ratings contains the values of our dependent variable, whereas condition contains all our conditions.

This experiment has two independent variables, drink (beer or water) and imagery (neutral or negative). In our analysis, we wish to know the separate contribution of these two independent variables and their interaction.
So, we need to separate condition into 2 variables:

Saving

The original data are saved in a .dat file. This format is not what the cool kids use. Let’s save our processed data as .csv.

If you want to read from .csv, use read_csv.

Concatenate operations

Thanks to the versatility of the tidyverse (especially the pipe operator), all the above operations can be performed in one go:

Summary

It is often required to provide summary statistics of the data. How to do it with tidyverse?

## # A tibble: 4 x 9
## # Groups:   drink [2]
##   drink imagery      n  mean    sd   sem   min   max ci.95
##   <chr> <chr>    <int> <dbl> <dbl> <dbl> <int> <int> <dbl>
## 1 beer  negative    20  4.45 17.3   3.87   -19    30  7.58
## 2 beer  neutral     20 10    10.3   2.30   -10    28  4.51
## 3 water negative    20 -9.2   6.80  1.52   -20     5  2.98
## 4 water neutral     20  2.35  6.84  1.53   -13    12  3.00

Exercises

Exercise 1

Do the following operations in one go:

  • load the original data (MixedAttitude.dat)
  • convert gender to a categorical variable with 2 levels:
    • 1 –> female
    • 2 –> male
  • eliminate from the data set the following variables: beerpos, beerneg, beerneut, winepos, waterpos
  • using the function rename, rename the variables you kept:
    • ssj –> participant
    • wineneg –> wine_negative
    • wineneut –> wine_neutral
    • waterneg –> water_negative
    • waterneu –> water_neutral
  • filter out all participants who rated water preceded by neutral imagery as lower than -10
  • convert the data set to long format
  • separate conditions into 2 variables (drink and imagery) and convert them to factors
  • save the data set as data_attitude_exercise.csv
  • separately for drink, imagery, and gender, calculate the following summary statistics:
    • number of observations
    • median
    • median absolute deviation
    • minimum value
    • maximum value
  • display the results in console

Plotting

Plotting is one of the most satisfying things to do in R, especially if you use the package ggplot2… part of the tidyverse!

In a nutshell, ggplot2 allows you to build plots iteratively using a series of layers. You start with a data set and specify its aesthetics (e.g., which variable should be represented on the x-axis?). Later, you can add layers with annotations, statistical summaries, and so on.

Bar plot

Let’s start by creating a basic bar plot. We will use the data frame with the summary statistics that we just created.

What the hell is that?!? Oh no, it’s a stacked bar graph! We don’t want that! How can we get separate bars for negative and neutral imagery?

Better. Let’s add outlines.

Something is missing… ah, error bars! Display 95% confidence intervals.

These colors are hideous… let’s use a more decent color palette.

The package viridis uses colors that are easier to distinguish for people with colorblindness.

Let’s add a final cosmetic touch.

RDI plots

No matter how pretty a bar graph is, it remains a suboptimal way of displaying your data.
A better solution is to use RDI plots, which show Raw data, Descriptive & Inferential statistics.

The following graph shows:

  • points representing the raw data
  • smoothed densities
  • box and whisker plot:
    • vertical bars: medians
    • boxes: upper and lower quartiles
    • whiskers: minimum and maximum values
att.filter.select.recode.long.sep %>% # we need the complete data frame, not the summary
  unite("condition", c(drink, imagery)) %>% # paste 'drink' and 'imagery' columns into 'condition'
  # base plot
  ggplot(
    .,
    aes(
      x = condition,
      y = ratings
    )
  ) +
  # box and whisker plot
  geom_boxplot(
    alpha = 1, # boxes: transparency
    size = .5, # boxes: line thickness
    outlier.alpha = 0
  ) + # outliers: transparency
  stat_boxplot(
    geom = "errorbar", # whiskers
    size = .5, # whiskers: line thickness
    width = .25
  ) + # whiskers: width
  # violin plot
  geom_violin(aes(fill = condition), # density: color fill
    color = "transparent", # outline: color
    alpha = .25
  ) + # density: transparency
  # jittered data points
  geom_jitter(
    size = 3, # point: size
    alpha = .3, # point: transparency
    position = position_jitter(width = .1)
  ) + # point: jitter
  scale_fill_viridis(
    option = "viridis", # color palette for all fills
    discrete = TRUE
  ) +
  scale_color_viridis(
    option = "viridis", # color palette for all outlines
    discrete = TRUE
  ) +
  scale_x_discrete(
    limits = # x-axis: set variable order
    c(
      "water_neutral", "water_negative",
      "beer_neutral", "beer_negative"
    )
  ) +
  scale_y_continuous(
    name = "", # y-axis: title
    limits = c(-25, 35), # y-axis: min/max values
    breaks = seq(-25, 35, 5)
  ) + # y-axis: tick marks
  geom_hline(
    yintercept = seq(-25, 35, 5), # reference lines
    linetype = "dotted", # reference lines: type
    colour = "#999999", # reference lines: color
    size = .8, # reference lines: thickness
    alpha = .5
  ) + # reference lines: transparency
  ggtitle("mean ratings") + # plot title
  theme_classic(base_size = 18) + # custom theme (resize text)
  theme(
    legend.position = "none", # no legend
    plot.title = element_text(size = 24, hjust = .5)
  ) # resize and center title

If you don’t want to waste time doing it yourself, I recommend the yarrr package.

The pirateplot function creates a graph showing:

  • points representing the raw data
  • smoothed densities
  • vertical bars showing central tendencies
  • rectangles representing inference intervals (e.g., 95% confidence intervals)

The graphs above suggest that negative imagery may have influenced ratings to water more than to beer.

Exercises

Exercise 2.1

Compute the following operations in one go:

  • load data_attitude_exercise.csv (use a function from the tidyverse instead of read.table)
  • separately for drink, imagery, and gender, calculate the summary statistics that you will need for the plot (i.e., means and 95% confidence intervals)
  • create bar graphs:
    • separate bars for negative and neutral imagery
    • separate graphs for male and female participants, placed horizontally (hint: use facet_grid)
    • black outlines
    • error bars displaying 95% confidence intervals
    • viridis color palette
    • no title on the x-axis
    • limit the y-axis from -20 to +20, with tick marks every 5
    • add a plot title
    • text size in the plot: 18 points
    • text size of the title: 32 points
    • title in the center
    • legend position in the center of the graph (not overlapping with the bars)
    • no legend title
    • no legend background

Exercise 2.2

Compute the following operations in one go:

  • load data_attitude_exercise.csv
  • using the yarrr package, create an RDI graph similar to the bar graph of Exercise 2.1 (i.e., separate graphs for female and male participants)
  • modify it as you please (check the pirateplot options), e.g., change color palette (hint: type piratepal("all"))

Data analysis

The aim of this study was to assess whether negative imagery would influence the likeness ratings of alcoholic beverages. It’s time to statistically verify this hypothesis.

We will run a 2 (drink: water, beer) x 2 (imagery: neutral, negative) repeated measures ANOVA on these ratings.

For this demonstration I chose the package afex, authored by Henrik Singmann.

Repeated measures ANOVA

The code below shows how to run an ANOVA with this versatile and user-friendly package.

## Anova Table (Type 3 tests)
## 
## Response: ratings
##          Effect    df    MSE         F ges p.value
## 1         drink 1, 19 252.84   8.97 ** .19    .007
## 2       imagery 1, 19  82.92 17.63 *** .13   .0005
## 3 drink:imagery 1, 19  26.66    6.75 * .02     .02
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

The results show:

  • a statistically significant main effect of drink (F(1, 19) = 8.97, p = 0.007, \(\eta\)2G = 0.19)
  • a statistically significant main effect of imagery (F(1, 19) = 17.63, p = 0.00049, \(\eta\)2G = 0.13)
  • a statistically significant drink x imagery interaction (F(1, 19) = 6.75, p = 0.018, \(\eta\)2G = 0.02)

Paired contrasts

The drink x imagery interaction is statistically significant… let’s run paired comparisons. afex uses functions included in the emmeans and multcomp packages.

## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
##     Indicator predictors are now treated as 2-level factors by default.
##     To revert to old behavior, use emm_options(cov.keep = character(0))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Linear Hypotheses:
##                                     Estimate Std. Error t value Pr(>|t|)    
## negative,beer - neutral,beer == 0     -5.550      2.604  -2.131   0.1423    
## negative,beer - negative,water == 0   13.650      4.329   3.153   0.0185 *  
## negative,beer - neutral,water == 0     2.100      4.997   0.420   0.9613    
## neutral,beer - negative,water == 0    19.200      2.934   6.544   <0.001 ***
## neutral,beer - neutral,water == 0      7.650      3.034   2.521   0.0691 .  
## negative,water - neutral,water == 0  -11.550      2.044  -5.652   <0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The output has all you need to report in a publication… well, almost. Something is missing.

Effect sizes

It is good practice to report effect sizes along with p-values, so that readers can make their own mind with respect to the importance of the observed effects. Even better, you could report confidence intervals around effect sizes, so that readers can have a clear picture of the precision of your estimation.

I particularly like the idea of bootstrapping effect sizes, a better approach when the data are known not to be normally distributed or when the distribution is unknown6. I will show you how to do it using the package bootES7.

We will compute Hegdes’s g, an unbiased estimate of \(\delta\) (for details, see here)

Because our dependent variable consists of ratings collected from the same participants in different conditions (i.e., repeated measures), we must first manually compute the difference scores between our contrasts of interest (see the output of the paired comparisons above).

Now we can calculate the standardized effect size for each difference scores. Using functions in the purrr package (part of the tidyverse!), we will create separate lists for each difference score and calculate bootstrapped Hegdes’s g for each of them.

The result is a data frame containing 6 lists (i.e., the number of diff.conds), each containing the results of the bootstrapping procedure. As an example, let’s look at the list containing the bootstrapped Hegdes’s g for the difference ratings of beer after seeing negative vs. neutral imagery:

## 
## 95.00% bca Confidence Interval, 5000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## -0.457      -0.927      0.052       -0.044      0.256

Summary of paired contrasts

The final step is to summarize the information stored in different lists in one single data frame that can later be converted into a table and cleaned up for publication. This is one way to do it:

Exercises

Exercise 3.1

Run the same analyses using the exercise data set (i.e., wine instead of beer):

  • 2 (drink) x 2 (imagery) repeated measures ANOVA on likeness ratings
  • paired contrasts
  • bootstrapped effect sizes (calculate Pearson’s r instead of Hegdes’s g)
  • summary table of paired contrasts

Exercise 3.2

Run a 2 (gender) x 2 (drink) x 2 (imagery) mixed ANOVA on likeness ratings:

  • remember: gender is a between-subject factor!
  • paired contrasts: test only difference ratings between female and male participants (hint: see example here):
    • wine_negative
    • wine_neutral
    • water_negative
    • water_neutral
  • bootstrapped effect sizes (Cohen’s d) of the paired comparisons of interest
  • summary table of paired contrasts

Conclusion

Hopefully, I managed to give you a glimpse of the versatility, readability, and user-friendliness of the tidyverse. Enjoy your tidy new life!


Thanks!




  1. Hopefully you’re not familiar with pretending to collect data… if so, please tell your story by writing a book.

  2. As you may imagine, today we won’t have time to cover all the amazing things you can do with these packages. Also, they are a gift that keeps on giving: I have been using them for a while and I keep discovering useful functions. If you want to learn more, read R for Data Science by Garrett Grolemund and Hadley Wickham.

  3. It is good to know where your packages are. Type .libPaths() in the RStudio console to find out.

  4. When you have no idea where to start, Google (or its privacy-friendly alternative StartPage) and StackOverflow are your friends!

  5. This is a modified version of a data set included in the book Discovering Statistics Using R (Field, Miles & Field, 2012).

  6. When you have no idea where to start, Google (or its privacy-friendly alternative StartPage) and StackOverflow are your friends!

  7. This is a modified version of a data set included in the book Discovering Statistics Using R (Field, Miles & Field, 2012).

 

Antonio Schettino

schettino@eur.nl